Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery. [http://machinelearningmastery.com/]
SUMMARY: The purpose of this project is to construct a predictive model using various machine learning algorithms and to document the end-to-end steps using a template. The Forest Cover Type dataset is a multi-class classification situation where we are trying to predict one of the seven possible outcomes.
INTRODUCTION: This experiment tries to predict forest cover type from cartographic variables only. This study area includes four wilderness areas located in the Roosevelt National Forest of northern Colorado. These areas represent forests with minimal human-caused disturbances, so that existing forest cover types are more a result of ecological processes rather than forest management practices.
The actual forest cover type for a given observation (30 x 30 meter cell) was determined from the US Forest Service (USFS) Region 2 Resource Information System (RIS) data. Independent variables were derived from data originally obtained from the US Geological Survey (USGS) and USFS data. Data is in raw form (not scaled) and contains binary (0 or 1) columns of data for qualitative independent variables (wilderness areas and soil types).
ANALYSIS: The baseline performance of the machine learning algorithms achieved an average accuracy of 78.04%. Two algorithms (Random Forest and Gradient Boosting) achieved the top accuracy metrics after the first round of modeling. After a series of tuning trials, Random Forest turned in the top overall result and achieved an accuracy metric of 85.48%. By using the optimized parameters, the Random Forest algorithm processed the testing dataset with an accuracy of 86.07%, which was even better than the predictions from the training data.
CONCLUSION: For this iteration, the Random Forest algorithm achieved the best overall results using the training and testing datasets. For this dataset, Random Forest should be considered for further modeling.
Dataset Used: Covertype Data Set
Dataset ML Model: Multi-Class classification with numerical attributes
Dataset Reference: https://archive.ics.uci.edu/ml/datasets/Covertype
One source of potential performance benchmarks: https://www.kaggle.com/c/forest-cover-type-prediction/overview
The project aims to touch on the following areas:
Any predictive modeling machine learning project genrally can be broken down into about six major tasks:
startTimeScript <- proc.time()
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
## Registered S3 methods overwritten by 'ggplot2':
## method from
## [.quosures rlang
## c.quosures rlang
## print.quosures rlang
library(corrplot)
## corrplot 0.84 loaded
library(DMwR)
## Loading required package: grid
## Registered S3 method overwritten by 'xts':
## method from
## as.zoo.xts zoo
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library(Hmisc)
## Loading required package: survival
##
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
##
## cluster
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
library(mailR)
## Registered S3 method overwritten by 'R.oo':
## method from
## throw.default R.methodsS3
library(ROCR)
## Loading required package: gplots
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
library(stringr)
# Create one random seed number for reproducible results
seedNum <- 888
set.seed(seedNum)
email_notify <- function(msg=""){
sender <- Sys.getenv("MAIL_SENDER")
receiver <- Sys.getenv("MAIL_RECEIVER")
gateway <- Sys.getenv("SMTP_GATEWAY")
smtpuser <- Sys.getenv("SMTP_USERNAME")
password <- Sys.getenv("SMTP_PASSWORD")
sbj_line <- "Notification from R Binary Classification Script"
send.mail(
from = sender,
to = receiver,
subject= sbj_line,
body = msg,
smtp = list(host.name = gateway, port = 587, user.name = smtpuser, passwd = password, ssl = TRUE),
authenticate = TRUE,
send = TRUE)
}
# Set up the muteEmail flag to stop sending progress emails (setting FALSE will send emails!)
notifyStatus <- FALSE
if (notifyStatus) email_notify(paste("Library and Data Loading has begun!",date()))
# Slicing up the document path to get the final destination file name
dataset_path <- 'https://www.kaggle.com/c/forest-cover-type-prediction/download/train.csv'
doc_path_list <- str_split(dataset_path, "/")
dest_file <- doc_path_list[[1]][length(doc_path_list[[1]])]
if (!file.exists(dest_file)) {
# Download the document from the website
cat("Downloading", dataset_path, "as", dest_file, "\n")
download.file(dataset_path, dest_file, mode = "wb")
cat(dest_file, "downloaded!\n")
# unzip(dest_file)
# cat(dest_file, "unpacked!\n")
}
inputFile <- dest_file
Xy_original <- read.csv(inputFile, sep=',', header=TRUE, row.names=1)
Xy_original$Cover_Type <- as.factor(Xy_original$Cover_Type)
# Take a peek at the dataframe after the import
head(Xy_original)
## Elevation Aspect Slope Horizontal_Distance_To_Hydrology
## 1 2596 51 3 258
## 2 2590 56 2 212
## 3 2804 139 9 268
## 4 2785 155 18 242
## 5 2595 45 2 153
## 6 2579 132 6 300
## Vertical_Distance_To_Hydrology Horizontal_Distance_To_Roadways
## 1 0 510
## 2 -6 390
## 3 65 3180
## 4 118 3090
## 5 -1 391
## 6 -15 67
## Hillshade_9am Hillshade_Noon Hillshade_3pm
## 1 221 232 148
## 2 220 235 151
## 3 234 238 135
## 4 238 238 122
## 5 220 234 150
## 6 230 237 140
## Horizontal_Distance_To_Fire_Points Wilderness_Area1 Wilderness_Area2
## 1 6279 1 0
## 2 6225 1 0
## 3 6121 1 0
## 4 6211 1 0
## 5 6172 1 0
## 6 6031 1 0
## Wilderness_Area3 Wilderness_Area4 Soil_Type1 Soil_Type2 Soil_Type3
## 1 0 0 0 0 0
## 2 0 0 0 0 0
## 3 0 0 0 0 0
## 4 0 0 0 0 0
## 5 0 0 0 0 0
## 6 0 0 0 0 0
## Soil_Type4 Soil_Type5 Soil_Type6 Soil_Type7 Soil_Type8 Soil_Type9
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
## Soil_Type10 Soil_Type11 Soil_Type12 Soil_Type13 Soil_Type14 Soil_Type15
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 0 1 0 0 0
## 4 0 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
## Soil_Type16 Soil_Type17 Soil_Type18 Soil_Type19 Soil_Type20 Soil_Type21
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
## Soil_Type22 Soil_Type23 Soil_Type24 Soil_Type25 Soil_Type26 Soil_Type27
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
## Soil_Type28 Soil_Type29 Soil_Type30 Soil_Type31 Soil_Type32 Soil_Type33
## 1 0 1 0 0 0 0
## 2 0 1 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 1 0 0 0
## 5 0 1 0 0 0 0
## 6 0 1 0 0 0 0
## Soil_Type34 Soil_Type35 Soil_Type36 Soil_Type37 Soil_Type38 Soil_Type39
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
## Soil_Type40 Cover_Type
## 1 0 5
## 2 0 5
## 3 0 2
## 4 0 2
## 5 0 5
## 6 0 2
sapply(Xy_original, class)
## Elevation Aspect
## "integer" "integer"
## Slope Horizontal_Distance_To_Hydrology
## "integer" "integer"
## Vertical_Distance_To_Hydrology Horizontal_Distance_To_Roadways
## "integer" "integer"
## Hillshade_9am Hillshade_Noon
## "integer" "integer"
## Hillshade_3pm Horizontal_Distance_To_Fire_Points
## "integer" "integer"
## Wilderness_Area1 Wilderness_Area2
## "integer" "integer"
## Wilderness_Area3 Wilderness_Area4
## "integer" "integer"
## Soil_Type1 Soil_Type2
## "integer" "integer"
## Soil_Type3 Soil_Type4
## "integer" "integer"
## Soil_Type5 Soil_Type6
## "integer" "integer"
## Soil_Type7 Soil_Type8
## "integer" "integer"
## Soil_Type9 Soil_Type10
## "integer" "integer"
## Soil_Type11 Soil_Type12
## "integer" "integer"
## Soil_Type13 Soil_Type14
## "integer" "integer"
## Soil_Type15 Soil_Type16
## "integer" "integer"
## Soil_Type17 Soil_Type18
## "integer" "integer"
## Soil_Type19 Soil_Type20
## "integer" "integer"
## Soil_Type21 Soil_Type22
## "integer" "integer"
## Soil_Type23 Soil_Type24
## "integer" "integer"
## Soil_Type25 Soil_Type26
## "integer" "integer"
## Soil_Type27 Soil_Type28
## "integer" "integer"
## Soil_Type29 Soil_Type30
## "integer" "integer"
## Soil_Type31 Soil_Type32
## "integer" "integer"
## Soil_Type33 Soil_Type34
## "integer" "integer"
## Soil_Type35 Soil_Type36
## "integer" "integer"
## Soil_Type37 Soil_Type38
## "integer" "integer"
## Soil_Type39 Soil_Type40
## "integer" "integer"
## Cover_Type
## "factor"
sapply(Xy_original, function(x) sum(is.na(x)))
## Elevation Aspect
## 0 0
## Slope Horizontal_Distance_To_Hydrology
## 0 0
## Vertical_Distance_To_Hydrology Horizontal_Distance_To_Roadways
## 0 0
## Hillshade_9am Hillshade_Noon
## 0 0
## Hillshade_3pm Horizontal_Distance_To_Fire_Points
## 0 0
## Wilderness_Area1 Wilderness_Area2
## 0 0
## Wilderness_Area3 Wilderness_Area4
## 0 0
## Soil_Type1 Soil_Type2
## 0 0
## Soil_Type3 Soil_Type4
## 0 0
## Soil_Type5 Soil_Type6
## 0 0
## Soil_Type7 Soil_Type8
## 0 0
## Soil_Type9 Soil_Type10
## 0 0
## Soil_Type11 Soil_Type12
## 0 0
## Soil_Type13 Soil_Type14
## 0 0
## Soil_Type15 Soil_Type16
## 0 0
## Soil_Type17 Soil_Type18
## 0 0
## Soil_Type19 Soil_Type20
## 0 0
## Soil_Type21 Soil_Type22
## 0 0
## Soil_Type23 Soil_Type24
## 0 0
## Soil_Type25 Soil_Type26
## 0 0
## Soil_Type27 Soil_Type28
## 0 0
## Soil_Type29 Soil_Type30
## 0 0
## Soil_Type31 Soil_Type32
## 0 0
## Soil_Type33 Soil_Type34
## 0 0
## Soil_Type35 Soil_Type36
## 0 0
## Soil_Type37 Soil_Type38
## 0 0
## Soil_Type39 Soil_Type40
## 0 0
## Cover_Type
## 0
# Not applicable for this iteration of the project.
# Take a peek at the dataframe after the cleaning
head(Xy_original)
## Elevation Aspect Slope Horizontal_Distance_To_Hydrology
## 1 2596 51 3 258
## 2 2590 56 2 212
## 3 2804 139 9 268
## 4 2785 155 18 242
## 5 2595 45 2 153
## 6 2579 132 6 300
## Vertical_Distance_To_Hydrology Horizontal_Distance_To_Roadways
## 1 0 510
## 2 -6 390
## 3 65 3180
## 4 118 3090
## 5 -1 391
## 6 -15 67
## Hillshade_9am Hillshade_Noon Hillshade_3pm
## 1 221 232 148
## 2 220 235 151
## 3 234 238 135
## 4 238 238 122
## 5 220 234 150
## 6 230 237 140
## Horizontal_Distance_To_Fire_Points Wilderness_Area1 Wilderness_Area2
## 1 6279 1 0
## 2 6225 1 0
## 3 6121 1 0
## 4 6211 1 0
## 5 6172 1 0
## 6 6031 1 0
## Wilderness_Area3 Wilderness_Area4 Soil_Type1 Soil_Type2 Soil_Type3
## 1 0 0 0 0 0
## 2 0 0 0 0 0
## 3 0 0 0 0 0
## 4 0 0 0 0 0
## 5 0 0 0 0 0
## 6 0 0 0 0 0
## Soil_Type4 Soil_Type5 Soil_Type6 Soil_Type7 Soil_Type8 Soil_Type9
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
## Soil_Type10 Soil_Type11 Soil_Type12 Soil_Type13 Soil_Type14 Soil_Type15
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 0 1 0 0 0
## 4 0 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
## Soil_Type16 Soil_Type17 Soil_Type18 Soil_Type19 Soil_Type20 Soil_Type21
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
## Soil_Type22 Soil_Type23 Soil_Type24 Soil_Type25 Soil_Type26 Soil_Type27
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
## Soil_Type28 Soil_Type29 Soil_Type30 Soil_Type31 Soil_Type32 Soil_Type33
## 1 0 1 0 0 0 0
## 2 0 1 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 1 0 0 0
## 5 0 1 0 0 0 0
## 6 0 1 0 0 0 0
## Soil_Type34 Soil_Type35 Soil_Type36 Soil_Type37 Soil_Type38 Soil_Type39
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
## Soil_Type40 Cover_Type
## 1 0 5
## 2 0 5
## 3 0 2
## 4 0 2
## 5 0 5
## 6 0 2
sapply(Xy_original, class)
## Elevation Aspect
## "integer" "integer"
## Slope Horizontal_Distance_To_Hydrology
## "integer" "integer"
## Vertical_Distance_To_Hydrology Horizontal_Distance_To_Roadways
## "integer" "integer"
## Hillshade_9am Hillshade_Noon
## "integer" "integer"
## Hillshade_3pm Horizontal_Distance_To_Fire_Points
## "integer" "integer"
## Wilderness_Area1 Wilderness_Area2
## "integer" "integer"
## Wilderness_Area3 Wilderness_Area4
## "integer" "integer"
## Soil_Type1 Soil_Type2
## "integer" "integer"
## Soil_Type3 Soil_Type4
## "integer" "integer"
## Soil_Type5 Soil_Type6
## "integer" "integer"
## Soil_Type7 Soil_Type8
## "integer" "integer"
## Soil_Type9 Soil_Type10
## "integer" "integer"
## Soil_Type11 Soil_Type12
## "integer" "integer"
## Soil_Type13 Soil_Type14
## "integer" "integer"
## Soil_Type15 Soil_Type16
## "integer" "integer"
## Soil_Type17 Soil_Type18
## "integer" "integer"
## Soil_Type19 Soil_Type20
## "integer" "integer"
## Soil_Type21 Soil_Type22
## "integer" "integer"
## Soil_Type23 Soil_Type24
## "integer" "integer"
## Soil_Type25 Soil_Type26
## "integer" "integer"
## Soil_Type27 Soil_Type28
## "integer" "integer"
## Soil_Type29 Soil_Type30
## "integer" "integer"
## Soil_Type31 Soil_Type32
## "integer" "integer"
## Soil_Type33 Soil_Type34
## "integer" "integer"
## Soil_Type35 Soil_Type36
## "integer" "integer"
## Soil_Type37 Soil_Type38
## "integer" "integer"
## Soil_Type39 Soil_Type40
## "integer" "integer"
## Cover_Type
## "factor"
sapply(Xy_original, function(x) sum(is.na(x)))
## Elevation Aspect
## 0 0
## Slope Horizontal_Distance_To_Hydrology
## 0 0
## Vertical_Distance_To_Hydrology Horizontal_Distance_To_Roadways
## 0 0
## Hillshade_9am Hillshade_Noon
## 0 0
## Hillshade_3pm Horizontal_Distance_To_Fire_Points
## 0 0
## Wilderness_Area1 Wilderness_Area2
## 0 0
## Wilderness_Area3 Wilderness_Area4
## 0 0
## Soil_Type1 Soil_Type2
## 0 0
## Soil_Type3 Soil_Type4
## 0 0
## Soil_Type5 Soil_Type6
## 0 0
## Soil_Type7 Soil_Type8
## 0 0
## Soil_Type9 Soil_Type10
## 0 0
## Soil_Type11 Soil_Type12
## 0 0
## Soil_Type13 Soil_Type14
## 0 0
## Soil_Type15 Soil_Type16
## 0 0
## Soil_Type17 Soil_Type18
## 0 0
## Soil_Type19 Soil_Type20
## 0 0
## Soil_Type21 Soil_Type22
## 0 0
## Soil_Type23 Soil_Type24
## 0 0
## Soil_Type25 Soil_Type26
## 0 0
## Soil_Type27 Soil_Type28
## 0 0
## Soil_Type29 Soil_Type30
## 0 0
## Soil_Type31 Soil_Type32
## 0 0
## Soil_Type33 Soil_Type34
## 0 0
## Soil_Type35 Soil_Type36
## 0 0
## Soil_Type37 Soil_Type38
## 0 0
## Soil_Type39 Soil_Type40
## 0 0
## Cover_Type
## 0
# Use variable totCol to hold the number of columns in the dataframe
totCol <- ncol(Xy_original)
# Set up variable totAttr for the total number of attribute columns
totAttr <- totCol-1
# targetCol variable indicates the column location of the target/class variable
# If the first column, set targetCol to 1. If the last column, set targetCol to totCol
# if (targetCol <> 1) and (targetCol <> totCol), be aware when slicing up the dataframes for visualization!
targetCol <- totCol
# Standardize the class column to the name of targetVar if applicable
colnames(Xy_original)[targetCol] <- "targetVar"
# We create training datasets (Xy_train, X_train, y_train) for various visualization and cleaning/transformation operations.
# We create testing datasets (Xy_test, y_test) for various visualization and cleaning/transformation operations.
set.seed(seedNum)
# Create a list of the rows in the original dataset we can use for training
# Use 70% of the data to train the models and the remaining for testing/validation
training_index <- createDataPartition(Xy_original$targetVar, p=0.70, list=FALSE)
Xy_train <- Xy_original[training_index,]
Xy_test <- Xy_original[-training_index,]
if (targetCol==1) {
X_train <- Xy_train[,(targetCol+1):totCol]
y_train <- Xy_train[,targetCol]
y_test <- Xy_test[,targetCol]
} else {
X_train <- Xy_train[,1:(totAttr)]
y_train <- Xy_train[,totCol]
y_test <- Xy_test[,totCol]
}
# Set up the number of row and columns for visualization display. dispRow * dispCol should be >= totAttr
dispCol <- 3
if (totAttr%%dispCol == 0) {
dispRow <- totAttr%/%dispCol
} else {
dispRow <- (totAttr%/%dispCol) + 1
}
cat("Will attempt to create graphics grid (col x row): ", dispCol, ' by ', dispRow)
## Will attempt to create graphics grid (col x row): 3 by 18
# Run algorithms using 10-fold cross validation
control <- trainControl(method="repeatedcv", number=10, repeats=1)
metricTarget <- "Accuracy"
if (notifyStatus) email_notify(paste("Library and Data Loading completed!",date()))
To gain a better understanding of the data that we have on-hand, we will leverage a number of descriptive statistics and data visualization techniques. The plan is to use the results to consider new questions, review assumptions, and validate hypotheses that we can investigate later with specialized models.
if (notifyStatus) email_notify(paste("Data Summarization and Visualization has begun!",date()))
head(Xy_train)
## Elevation Aspect Slope Horizontal_Distance_To_Hydrology
## 1 2596 51 3 258
## 3 2804 139 9 268
## 4 2785 155 18 242
## 5 2595 45 2 153
## 6 2579 132 6 300
## 7 2606 45 7 270
## Vertical_Distance_To_Hydrology Horizontal_Distance_To_Roadways
## 1 0 510
## 3 65 3180
## 4 118 3090
## 5 -1 391
## 6 -15 67
## 7 5 633
## Hillshade_9am Hillshade_Noon Hillshade_3pm
## 1 221 232 148
## 3 234 238 135
## 4 238 238 122
## 5 220 234 150
## 6 230 237 140
## 7 222 225 138
## Horizontal_Distance_To_Fire_Points Wilderness_Area1 Wilderness_Area2
## 1 6279 1 0
## 3 6121 1 0
## 4 6211 1 0
## 5 6172 1 0
## 6 6031 1 0
## 7 6256 1 0
## Wilderness_Area3 Wilderness_Area4 Soil_Type1 Soil_Type2 Soil_Type3
## 1 0 0 0 0 0
## 3 0 0 0 0 0
## 4 0 0 0 0 0
## 5 0 0 0 0 0
## 6 0 0 0 0 0
## 7 0 0 0 0 0
## Soil_Type4 Soil_Type5 Soil_Type6 Soil_Type7 Soil_Type8 Soil_Type9
## 1 0 0 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
## 7 0 0 0 0 0 0
## Soil_Type10 Soil_Type11 Soil_Type12 Soil_Type13 Soil_Type14 Soil_Type15
## 1 0 0 0 0 0 0
## 3 0 0 1 0 0 0
## 4 0 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
## 7 0 0 0 0 0 0
## Soil_Type16 Soil_Type17 Soil_Type18 Soil_Type19 Soil_Type20 Soil_Type21
## 1 0 0 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
## 7 0 0 0 0 0 0
## Soil_Type22 Soil_Type23 Soil_Type24 Soil_Type25 Soil_Type26 Soil_Type27
## 1 0 0 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
## 7 0 0 0 0 0 0
## Soil_Type28 Soil_Type29 Soil_Type30 Soil_Type31 Soil_Type32 Soil_Type33
## 1 0 1 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 1 0 0 0
## 5 0 1 0 0 0 0
## 6 0 1 0 0 0 0
## 7 0 1 0 0 0 0
## Soil_Type34 Soil_Type35 Soil_Type36 Soil_Type37 Soil_Type38 Soil_Type39
## 1 0 0 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
## 7 0 0 0 0 0 0
## Soil_Type40 targetVar
## 1 0 5
## 3 0 2
## 4 0 2
## 5 0 5
## 6 0 2
## 7 0 5
dim(Xy_train)
## [1] 10584 55
sapply(Xy_train, class)
## Elevation Aspect
## "integer" "integer"
## Slope Horizontal_Distance_To_Hydrology
## "integer" "integer"
## Vertical_Distance_To_Hydrology Horizontal_Distance_To_Roadways
## "integer" "integer"
## Hillshade_9am Hillshade_Noon
## "integer" "integer"
## Hillshade_3pm Horizontal_Distance_To_Fire_Points
## "integer" "integer"
## Wilderness_Area1 Wilderness_Area2
## "integer" "integer"
## Wilderness_Area3 Wilderness_Area4
## "integer" "integer"
## Soil_Type1 Soil_Type2
## "integer" "integer"
## Soil_Type3 Soil_Type4
## "integer" "integer"
## Soil_Type5 Soil_Type6
## "integer" "integer"
## Soil_Type7 Soil_Type8
## "integer" "integer"
## Soil_Type9 Soil_Type10
## "integer" "integer"
## Soil_Type11 Soil_Type12
## "integer" "integer"
## Soil_Type13 Soil_Type14
## "integer" "integer"
## Soil_Type15 Soil_Type16
## "integer" "integer"
## Soil_Type17 Soil_Type18
## "integer" "integer"
## Soil_Type19 Soil_Type20
## "integer" "integer"
## Soil_Type21 Soil_Type22
## "integer" "integer"
## Soil_Type23 Soil_Type24
## "integer" "integer"
## Soil_Type25 Soil_Type26
## "integer" "integer"
## Soil_Type27 Soil_Type28
## "integer" "integer"
## Soil_Type29 Soil_Type30
## "integer" "integer"
## Soil_Type31 Soil_Type32
## "integer" "integer"
## Soil_Type33 Soil_Type34
## "integer" "integer"
## Soil_Type35 Soil_Type36
## "integer" "integer"
## Soil_Type37 Soil_Type38
## "integer" "integer"
## Soil_Type39 Soil_Type40
## "integer" "integer"
## targetVar
## "factor"
summary(Xy_train)
## Elevation Aspect Slope
## Min. :1874 Min. : 0.0 Min. : 0.00
## 1st Qu.:2377 1st Qu.: 65.0 1st Qu.:10.00
## Median :2751 Median :126.0 Median :15.00
## Mean :2751 Mean :156.8 Mean :16.48
## 3rd Qu.:3108 3rd Qu.:260.0 3rd Qu.:22.00
## Max. :3846 Max. :360.0 Max. :50.00
##
## Horizontal_Distance_To_Hydrology Vertical_Distance_To_Hydrology
## Min. : 0.0 Min. :-134.00
## 1st Qu.: 60.0 1st Qu.: 4.00
## Median : 180.0 Median : 32.00
## Mean : 226.4 Mean : 50.53
## 3rd Qu.: 323.2 3rd Qu.: 79.00
## Max. :1343.0 Max. : 554.00
##
## Horizontal_Distance_To_Roadways Hillshade_9am Hillshade_Noon
## Min. : 0 Min. : 58.0 Min. : 99.0
## 1st Qu.: 768 1st Qu.:196.0 1st Qu.:207.0
## Median :1317 Median :220.0 Median :223.0
## Mean :1714 Mean :212.7 Mean :219.1
## 3rd Qu.:2263 3rd Qu.:235.0 3rd Qu.:235.0
## Max. :6836 Max. :254.0 Max. :254.0
##
## Hillshade_3pm Horizontal_Distance_To_Fire_Points Wilderness_Area1
## Min. : 0.0 Min. : 30 Min. :0.0000
## 1st Qu.:107.0 1st Qu.: 732 1st Qu.:0.0000
## Median :138.0 Median :1256 Median :0.0000
## Mean :135.2 Mean :1515 Mean :0.2367
## 3rd Qu.:167.0 3rd Qu.:1992 3rd Qu.:0.0000
## Max. :247.0 Max. :6993 Max. :1.0000
##
## Wilderness_Area2 Wilderness_Area3 Wilderness_Area4 Soil_Type1
## Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.00000 Median :0.0000 Median :0.0000 Median :0.0000
## Mean :0.03231 Mean :0.4228 Mean :0.3082 Mean :0.0239
## 3rd Qu.:0.00000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.0000
## Max. :1.00000 Max. :1.0000 Max. :1.0000 Max. :1.0000
##
## Soil_Type2 Soil_Type3 Soil_Type4 Soil_Type5
## Min. :0.00000 Min. :0.00000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.00000 Median :0.00000 Median :0.00000 Median :0.00000
## Mean :0.04176 Mean :0.06349 Mean :0.05678 Mean :0.01058
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.00000 Max. :1.00000 Max. :1.00000
##
## Soil_Type6 Soil_Type7 Soil_Type8 Soil_Type9
## Min. :0.00000 Min. :0 Min. :0.00e+00 Min. :0.0000000
## 1st Qu.:0.00000 1st Qu.:0 1st Qu.:0.00e+00 1st Qu.:0.0000000
## Median :0.00000 Median :0 Median :0.00e+00 Median :0.0000000
## Mean :0.04091 Mean :0 Mean :9.45e-05 Mean :0.0006614
## 3rd Qu.:0.00000 3rd Qu.:0 3rd Qu.:0.00e+00 3rd Qu.:0.0000000
## Max. :1.00000 Max. :0 Max. :1.00e+00 Max. :1.0000000
##
## Soil_Type10 Soil_Type11 Soil_Type12 Soil_Type13
## Min. :0.0000 Min. :0.00000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.0000 Median :0.00000 Median :0.00000 Median :0.00000
## Mean :0.1419 Mean :0.02731 Mean :0.01455 Mean :0.03193
## 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.0000 Max. :1.00000 Max. :1.00000 Max. :1.00000
##
## Soil_Type14 Soil_Type15 Soil_Type16 Soil_Type17
## Min. :0.00000 Min. :0 Min. :0.000000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0 1st Qu.:0.000000 1st Qu.:0.00000
## Median :0.00000 Median :0 Median :0.000000 Median :0.00000
## Mean :0.01039 Mean :0 Mean :0.008031 Mean :0.04034
## 3rd Qu.:0.00000 3rd Qu.:0 3rd Qu.:0.000000 3rd Qu.:0.00000
## Max. :1.00000 Max. :0 Max. :1.000000 Max. :1.00000
##
## Soil_Type18 Soil_Type19 Soil_Type20
## Min. :0.000000 Min. :0.000000 Min. :0.000000
## 1st Qu.:0.000000 1st Qu.:0.000000 1st Qu.:0.000000
## Median :0.000000 Median :0.000000 Median :0.000000
## Mean :0.003874 Mean :0.003212 Mean :0.009448
## 3rd Qu.:0.000000 3rd Qu.:0.000000 3rd Qu.:0.000000
## Max. :1.000000 Max. :1.000000 Max. :1.000000
##
## Soil_Type21 Soil_Type22 Soil_Type23 Soil_Type24
## Min. :0.000000 Min. :0.00000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.000000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.000000 Median :0.00000 Median :0.00000 Median :0.00000
## Mean :0.001323 Mean :0.02353 Mean :0.05102 Mean :0.01635
## 3rd Qu.:0.000000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.000000 Max. :1.00000 Max. :1.00000 Max. :1.00000
##
## Soil_Type25 Soil_Type26 Soil_Type27
## Min. :0.00e+00 Min. :0.000000 Min. :0.0000000
## 1st Qu.:0.00e+00 1st Qu.:0.000000 1st Qu.:0.0000000
## Median :0.00e+00 Median :0.000000 Median :0.0000000
## Mean :9.45e-05 Mean :0.003401 Mean :0.0008503
## 3rd Qu.:0.00e+00 3rd Qu.:0.000000 3rd Qu.:0.0000000
## Max. :1.00e+00 Max. :1.000000 Max. :1.0000000
##
## Soil_Type28 Soil_Type29 Soil_Type30 Soil_Type31
## Min. :0.0000000 Min. :0.00000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.0000000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.0000000 Median :0.00000 Median :0.00000 Median :0.00000
## Mean :0.0002834 Mean :0.08626 Mean :0.04639 Mean :0.02173
## 3rd Qu.:0.0000000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.0000000 Max. :1.00000 Max. :1.00000 Max. :1.00000
##
## Soil_Type32 Soil_Type33 Soil_Type34 Soil_Type35
## Min. :0.00000 Min. :0.00000 Min. :0.000000 Min. :0.000000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.000000 1st Qu.:0.000000
## Median :0.00000 Median :0.00000 Median :0.000000 Median :0.000000
## Mean :0.04639 Mean :0.04101 Mean :0.001417 Mean :0.007086
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.000000 3rd Qu.:0.000000
## Max. :1.00000 Max. :1.00000 Max. :1.000000 Max. :1.000000
##
## Soil_Type36 Soil_Type37 Soil_Type38 Soil_Type39
## Min. :0.0000000 Min. :0.00000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.0000000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.0000000 Median :0.00000 Median :0.00000 Median :0.00000
## Mean :0.0006614 Mean :0.00189 Mean :0.04639 Mean :0.04403
## 3rd Qu.:0.0000000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.0000000 Max. :1.00000 Max. :1.00000 Max. :1.00000
##
## Soil_Type40 targetVar
## Min. :0.00000 1:1512
## 1st Qu.:0.00000 2:1512
## Median :0.00000 3:1512
## Mean :0.03071 4:1512
## 3rd Qu.:0.00000 5:1512
## Max. :1.00000 6:1512
## 7:1512
sapply(Xy_train, function(x) sum(is.na(x)))
## Elevation Aspect
## 0 0
## Slope Horizontal_Distance_To_Hydrology
## 0 0
## Vertical_Distance_To_Hydrology Horizontal_Distance_To_Roadways
## 0 0
## Hillshade_9am Hillshade_Noon
## 0 0
## Hillshade_3pm Horizontal_Distance_To_Fire_Points
## 0 0
## Wilderness_Area1 Wilderness_Area2
## 0 0
## Wilderness_Area3 Wilderness_Area4
## 0 0
## Soil_Type1 Soil_Type2
## 0 0
## Soil_Type3 Soil_Type4
## 0 0
## Soil_Type5 Soil_Type6
## 0 0
## Soil_Type7 Soil_Type8
## 0 0
## Soil_Type9 Soil_Type10
## 0 0
## Soil_Type11 Soil_Type12
## 0 0
## Soil_Type13 Soil_Type14
## 0 0
## Soil_Type15 Soil_Type16
## 0 0
## Soil_Type17 Soil_Type18
## 0 0
## Soil_Type19 Soil_Type20
## 0 0
## Soil_Type21 Soil_Type22
## 0 0
## Soil_Type23 Soil_Type24
## 0 0
## Soil_Type25 Soil_Type26
## 0 0
## Soil_Type27 Soil_Type28
## 0 0
## Soil_Type29 Soil_Type30
## 0 0
## Soil_Type31 Soil_Type32
## 0 0
## Soil_Type33 Soil_Type34
## 0 0
## Soil_Type35 Soil_Type36
## 0 0
## Soil_Type37 Soil_Type38
## 0 0
## Soil_Type39 Soil_Type40
## 0 0
## targetVar
## 0
cbind(freq=table(y_train), percentage=prop.table(table(y_train))*100)
## freq percentage
## 1 1512 14.28571
## 2 1512 14.28571
## 3 1512 14.28571
## 4 1512 14.28571
## 5 1512 14.28571
## 6 1512 14.28571
## 7 1512 14.28571
# Boxplots for each attribute
# par(mfrow=c(dispRow,dispCol))
for(i in 1:totAttr) {
boxplot(X_train[,i], main=names(X_train)[i])
}
# Histograms each attribute
# par(mfrow=c(dispRow,dispCol))
for(i in 1:totAttr) {
hist(X_train[,i], main=names(X_train)[i])
}
# Density plot for each attribute
# par(mfrow=c(dispRow,dispCol))
for(i in 1:totAttr) {
plot(density(X_train[,i]), main=names(X_train)[i])
}
# Scatterplot matrix colored by class
# pairs(targetVar~., data=Xy_train, col=Xy_train$targetVar)
# Box and whisker plots for each attribute by class
# scales <- list(x=list(relation="free"), y=list(relation="free"))
# featurePlot(x=X_train, y=y_train, plot="box", scales=scales)
# Density plots for each attribute by class value
# featurePlot(x=X_train, y=y_train, plot="density", scales=scales)
# Correlation plot
correlations <- cor(X_train)
## Warning in cor(X_train): the standard deviation is zero
corrplot(correlations, method="circle")
if (notifyStatus) email_notify(paste("Data Summarization and Visualization completed!",date()))
Some dataset may require additional preparation activities that will best exposes the structure of the problem and the relationships between the input attributes and the output variable. Some data-prep tasks might include:
if (notifyStatus) email_notify(paste("Data Cleaning and Transformation has begun!",date()))
# Not applicable for this iteration of the project.
# Not applicable for this iteration of the project.
# Not applicable for this iteration of the project.
dim(Xy_train)
## [1] 10584 55
dim(Xy_test)
## [1] 4536 55
sapply(Xy_train, class)
## Elevation Aspect
## "integer" "integer"
## Slope Horizontal_Distance_To_Hydrology
## "integer" "integer"
## Vertical_Distance_To_Hydrology Horizontal_Distance_To_Roadways
## "integer" "integer"
## Hillshade_9am Hillshade_Noon
## "integer" "integer"
## Hillshade_3pm Horizontal_Distance_To_Fire_Points
## "integer" "integer"
## Wilderness_Area1 Wilderness_Area2
## "integer" "integer"
## Wilderness_Area3 Wilderness_Area4
## "integer" "integer"
## Soil_Type1 Soil_Type2
## "integer" "integer"
## Soil_Type3 Soil_Type4
## "integer" "integer"
## Soil_Type5 Soil_Type6
## "integer" "integer"
## Soil_Type7 Soil_Type8
## "integer" "integer"
## Soil_Type9 Soil_Type10
## "integer" "integer"
## Soil_Type11 Soil_Type12
## "integer" "integer"
## Soil_Type13 Soil_Type14
## "integer" "integer"
## Soil_Type15 Soil_Type16
## "integer" "integer"
## Soil_Type17 Soil_Type18
## "integer" "integer"
## Soil_Type19 Soil_Type20
## "integer" "integer"
## Soil_Type21 Soil_Type22
## "integer" "integer"
## Soil_Type23 Soil_Type24
## "integer" "integer"
## Soil_Type25 Soil_Type26
## "integer" "integer"
## Soil_Type27 Soil_Type28
## "integer" "integer"
## Soil_Type29 Soil_Type30
## "integer" "integer"
## Soil_Type31 Soil_Type32
## "integer" "integer"
## Soil_Type33 Soil_Type34
## "integer" "integer"
## Soil_Type35 Soil_Type36
## "integer" "integer"
## Soil_Type37 Soil_Type38
## "integer" "integer"
## Soil_Type39 Soil_Type40
## "integer" "integer"
## targetVar
## "factor"
if (notifyStatus) email_notify(paste("Data Cleaning and Transformation completed!",date()))
proc.time()-startTimeScript
## user system elapsed
## 24.985 0.314 25.351
After the data-prep, we next work on finding a workable model by evaluating a subset of machine learning algorithms that are good at exploiting the structure of the training. The typical evaluation tasks include:
For this project, we will evaluate one linear, one non-linear, and three ensemble algorithms:
Linear Algorithm: Linear Discriminant Analysis
Non-Linear Algorithm: Decision Trees (CART)
Ensemble Algorithms: Bagged CART, Random Forest, and Gradient Boosting
The random number seed is reset before each run to ensure that the evaluation of each algorithm is performed using the same data splits. It ensures the results are directly comparable.
# Linear Discriminant Analysis (Classification)
# if (notifyStatus) email_notify(paste("Linear Discriminant Analysis modeling has begun!",date()))
# startTimeModule <- proc.time()
# set.seed(seedNum)
# fit.lda <- train(targetVar~., data=Xy_train, method="lda", metric=metricTarget, trControl=control)
# print(fit.lda)
# proc.time()-startTimeModule
# if (notifyStatus) email_notify(paste("Linear Discriminant Analysis modeling completed!",date()))
# Decision Tree - CART (Regression/Classification)
if (notifyStatus) email_notify(paste("Decision Tree modeling has begun!",date()))
startTimeModule <- proc.time()
set.seed(seedNum)
fit.cart <- train(targetVar~., data=Xy_train, method="rpart", metric=metricTarget, trControl=control)
print(fit.cart)
## CART
##
## 10584 samples
## 54 predictor
## 7 classes: '1', '2', '3', '4', '5', '6', '7'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 9525, 9527, 9526, 9524, 9524, 9526, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.09656085 0.4616631 0.37194057
## 0.13966049 0.3336852 0.22266751
## 0.16666667 0.2140159 0.08322312
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.09656085.
proc.time()-startTimeModule
## user system elapsed
## 5.880 1.074 5.779
if (notifyStatus) email_notify(paste("Decision Tree modeling completed!",date()))
In this section, we will explore the use and tuning of ensemble algorithms to see whether we can improve the results.
# Bagged CART (Regression/Classification)
if (notifyStatus) email_notify(paste("Bagged CART modeling has begun!",date()))
startTimeModule <- proc.time()
set.seed(seedNum)
fit.bagcart <- train(targetVar~., data=Xy_train, method="treebag", metric=metricTarget, trControl=control)
print(fit.bagcart)
## Bagged CART
##
## 10584 samples
## 54 predictor
## 7 classes: '1', '2', '3', '4', '5', '6', '7'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 9525, 9527, 9526, 9524, 9524, 9526, ...
## Resampling results:
##
## Accuracy Kappa
## 0.8356022 0.8082022
proc.time()-startTimeModule
## user system elapsed
## 93.253 25.123 89.223
if (notifyStatus) email_notify(paste("Bagged CART modeling completed!",date()))
# Random Forest (Regression/Classification)
if (notifyStatus) email_notify(paste("Random Forest modeling has begun!",date()))
startTimeModule <- proc.time()
set.seed(seedNum)
fit.rf <- train(targetVar~., data=Xy_train, method="rf", metric=metricTarget, trControl=control)
print(fit.rf)
## Random Forest
##
## 10584 samples
## 54 predictor
## 7 classes: '1', '2', '3', '4', '5', '6', '7'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 9525, 9527, 9526, 9524, 9524, 9526, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.6486189 0.5900545
## 28 0.8548767 0.8306890
## 54 0.8467495 0.8212076
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 28.
proc.time()-startTimeModule
## user system elapsed
## 993.184 1.998 997.266
if (notifyStatus) email_notify(paste("Random Forest modeling completed!",date()))
# Gradient Boosting (Regression/Classification)
if (notifyStatus) email_notify(paste("Gradient Boosting modeling has begun!",date()))
startTimeModule <- proc.time()
set.seed(seedNum)
fit.gbm <- train(targetVar~., data=Xy_train, method="xgbTree", metric=metricTarget, trControl=control, verbose=F)
# fit.gbm <- train(targetVar~., data=Xy_train, method="gbm", metric=metricTarget, trControl=control, verbose=F)
print(fit.gbm)
## eXtreme Gradient Boosting
##
## 10584 samples
## 54 predictor
## 7 classes: '1', '2', '3', '4', '5', '6', '7'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 9525, 9527, 9526, 9524, 9524, 9526, ...
## Resampling results across tuning parameters:
##
## eta max_depth colsample_bytree subsample nrounds Accuracy
## 0.3 1 0.6 0.50 50 0.6977556
## 0.3 1 0.6 0.50 100 0.7283655
## 0.3 1 0.6 0.50 150 0.7410261
## 0.3 1 0.6 0.75 50 0.6984144
## 0.3 1 0.6 0.75 100 0.7243995
## 0.3 1 0.6 0.75 150 0.7374395
## 0.3 1 0.6 1.00 50 0.6956708
## 0.3 1 0.6 1.00 100 0.7206196
## 0.3 1 0.6 1.00 150 0.7306347
## 0.3 1 0.8 0.50 50 0.7000185
## 0.3 1 0.8 0.50 100 0.7260981
## 0.3 1 0.8 0.50 150 0.7414052
## 0.3 1 0.8 0.75 50 0.6959566
## 0.3 1 0.8 0.75 100 0.7244945
## 0.3 1 0.8 0.75 150 0.7330902
## 0.3 1 0.8 1.00 50 0.6947281
## 0.3 1 0.8 1.00 100 0.7194849
## 0.3 1 0.8 1.00 150 0.7320524
## 0.3 2 0.6 0.50 50 0.7467871
## 0.3 2 0.6 0.50 100 0.7711631
## 0.3 2 0.6 0.50 150 0.7829738
## 0.3 2 0.6 0.75 50 0.7475453
## 0.3 2 0.6 0.75 100 0.7703133
## 0.3 2 0.6 0.75 150 0.7807045
## 0.3 2 0.6 1.00 50 0.7435765
## 0.3 2 0.6 1.00 100 0.7646469
## 0.3 2 0.6 1.00 150 0.7788188
## 0.3 2 0.8 0.50 50 0.7503782
## 0.3 2 0.8 0.50 100 0.7768320
## 0.3 2 0.8 0.50 150 0.7854270
## 0.3 2 0.8 0.75 50 0.7490550
## 0.3 2 0.8 0.75 100 0.7728660
## 0.3 2 0.8 0.75 150 0.7866596
## 0.3 2 0.8 1.00 50 0.7454659
## 0.3 2 0.8 1.00 100 0.7694636
## 0.3 2 0.8 1.00 150 0.7806110
## 0.3 3 0.6 0.50 50 0.7773073
## 0.3 3 0.6 0.50 100 0.8004541
## 0.3 3 0.6 0.50 150 0.8147203
## 0.3 3 0.6 0.75 50 0.7760776
## 0.3 3 0.6 0.75 100 0.8030984
## 0.3 3 0.6 0.75 150 0.8152895
## 0.3 3 0.6 1.00 50 0.7713515
## 0.3 3 0.6 1.00 100 0.7962051
## 0.3 3 0.6 1.00 150 0.8100919
## 0.3 3 0.8 0.50 50 0.7819346
## 0.3 3 0.8 0.50 100 0.8057452
## 0.3 3 0.8 0.50 150 0.8187843
## 0.3 3 0.8 0.75 50 0.7835419
## 0.3 3 0.8 0.75 100 0.8066893
## 0.3 3 0.8 0.75 150 0.8158560
## 0.3 3 0.8 1.00 50 0.7787234
## 0.3 3 0.8 1.00 100 0.8050859
## 0.3 3 0.8 1.00 150 0.8166105
## 0.4 1 0.6 0.50 50 0.7122081
## 0.4 1 0.6 0.50 100 0.7380996
## 0.4 1 0.6 0.50 150 0.7410263
## 0.4 1 0.6 0.75 50 0.7095653
## 0.4 1 0.6 0.75 100 0.7337520
## 0.4 1 0.6 0.75 150 0.7431052
## 0.4 1 0.6 1.00 50 0.7066330
## 0.4 1 0.6 1.00 100 0.7289316
## 0.4 1 0.6 1.00 150 0.7394195
## 0.4 1 0.8 0.50 50 0.7124911
## 0.4 1 0.8 0.50 100 0.7361138
## 0.4 1 0.8 0.50 150 0.7432004
## 0.4 1 0.8 0.75 50 0.7062557
## 0.4 1 0.8 0.75 100 0.7350735
## 0.4 1 0.8 0.75 150 0.7432928
## 0.4 1 0.8 1.00 50 0.7055938
## 0.4 1 0.8 1.00 100 0.7286497
## 0.4 1 0.8 1.00 150 0.7420664
## 0.4 2 0.6 0.50 50 0.7554825
## 0.4 2 0.6 0.50 100 0.7776826
## 0.4 2 0.6 0.50 150 0.7893041
## 0.4 2 0.6 0.75 50 0.7568038
## 0.4 2 0.6 0.75 100 0.7789119
## 0.4 2 0.6 0.75 150 0.7894001
## 0.4 2 0.6 1.00 50 0.7553856
## 0.4 2 0.6 1.00 100 0.7783440
## 0.4 2 0.6 1.00 150 0.7901538
## 0.4 2 0.8 0.50 50 0.7555748
## 0.4 2 0.8 0.50 100 0.7838269
## 0.4 2 0.8 0.50 150 0.7911963
## 0.4 2 0.8 0.75 50 0.7588795
## 0.4 2 0.8 0.75 100 0.7823107
## 0.4 2 0.8 0.75 150 0.7957289
## 0.4 2 0.8 1.00 50 0.7560503
## 0.4 2 0.8 1.00 100 0.7765509
## 0.4 2 0.8 1.00 150 0.7894943
## 0.4 3 0.6 0.50 50 0.7790993
## 0.4 3 0.6 0.50 100 0.8054634
## 0.4 3 0.6 0.50 150 0.8100921
## 0.4 3 0.6 0.75 50 0.7893069
## 0.4 3 0.6 0.75 100 0.8111303
## 0.4 3 0.6 0.75 150 0.8218104
## 0.4 3 0.6 1.00 50 0.7839180
## 0.4 3 0.6 1.00 100 0.8089546
## 0.4 3 0.6 1.00 150 0.8182170
## 0.4 3 0.8 0.50 50 0.7892103
## 0.4 3 0.8 0.50 100 0.8109439
## 0.4 3 0.8 0.50 150 0.8183156
## 0.4 3 0.8 0.75 50 0.7893061
## 0.4 3 0.8 0.75 100 0.8156677
## 0.4 3 0.8 0.75 150 0.8249252
## 0.4 3 0.8 1.00 50 0.7884536
## 0.4 3 0.8 1.00 100 0.8131161
## 0.4 3 0.8 1.00 150 0.8225619
## Kappa
## 0.6473803
## 0.6830925
## 0.6978627
## 0.6481495
## 0.6784647
## 0.6936779
## 0.6449482
## 0.6740552
## 0.6857392
## 0.6500199
## 0.6804464
## 0.6983051
## 0.6452818
## 0.6785758
## 0.6886043
## 0.6438483
## 0.6727313
## 0.6873935
## 0.7045833
## 0.7330225
## 0.7468023
## 0.7054685
## 0.7320313
## 0.7441547
## 0.7008385
## 0.7254209
## 0.7419544
## 0.7087737
## 0.7396364
## 0.7496643
## 0.7072297
## 0.7350091
## 0.7511019
## 0.7030424
## 0.7310401
## 0.7440456
## 0.7401909
## 0.7671956
## 0.7838393
## 0.7387568
## 0.7702809
## 0.7845038
## 0.7332420
## 0.7622385
## 0.7784401
## 0.7455896
## 0.7733689
## 0.7885812
## 0.7474643
## 0.7744697
## 0.7851644
## 0.7418430
## 0.7726000
## 0.7860452
## 0.6642406
## 0.6944478
## 0.6978625
## 0.6611582
## 0.6893764
## 0.7002879
## 0.6577378
## 0.6837522
## 0.6959879
## 0.6645718
## 0.6921313
## 0.7003988
## 0.6572968
## 0.6909180
## 0.7005071
## 0.6565249
## 0.6834234
## 0.6990763
## 0.7147283
## 0.7406289
## 0.7541875
## 0.7162702
## 0.7420636
## 0.7542991
## 0.7146153
## 0.7414004
## 0.7551784
## 0.7148358
## 0.7477964
## 0.7563951
## 0.7186915
## 0.7460279
## 0.7616828
## 0.7153914
## 0.7393090
## 0.7544093
## 0.7422818
## 0.7730395
## 0.7784401
## 0.7541905
## 0.7796515
## 0.7921116
## 0.7479033
## 0.7771130
## 0.7879193
## 0.7540776
## 0.7794340
## 0.7880343
## 0.7541897
## 0.7849448
## 0.7957457
## 0.7531952
## 0.7819681
## 0.7929882
##
## Tuning parameter 'gamma' was held constant at a value of 0
##
## Tuning parameter 'min_child_weight' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were nrounds = 150, max_depth = 3,
## eta = 0.4, gamma = 0, colsample_bytree = 0.8, min_child_weight = 1
## and subsample = 0.75.
proc.time()-startTimeModule
## user system elapsed
## 3804.571 27.404 1933.923
if (notifyStatus) email_notify(paste("Gradient Boosting modeling completed!",date()))
results <- resamples(list(CART=fit.cart, BagCART=fit.bagcart, RF=fit.rf, GBM=fit.gbm))
summary(results)
##
## Call:
## summary.resamples(object = results)
##
## Models: CART, BagCART, RF, GBM
## Number of resamples: 10
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## CART 0.4018868 0.4288752 0.4806238 0.4616631 0.4890160 0.4962193 0
## BagCART 0.8205855 0.8321897 0.8361665 0.8356022 0.8419187 0.8478261 0
## RF 0.8452830 0.8518089 0.8553875 0.8548767 0.8570416 0.8638941 0
## GBM 0.8175803 0.8210157 0.8253076 0.8249252 0.8268779 0.8327032 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## CART 0.3020993 0.3337927 0.3938647 0.3719406 0.4038891 0.4121922 0
## BagCART 0.7906879 0.8042202 0.8088612 0.8082022 0.8155709 0.8224647 0
## RF 0.8194995 0.8271086 0.8312837 0.8306890 0.8332144 0.8412100 0
## GBM 0.7871786 0.7911900 0.7961882 0.7957457 0.7980230 0.8048178 0
dotplot(results)
cat('The average accuracy from all models is:',
mean(c(results$values$`CART~Accuracy`,results$values$`BagCART~Accuracy`,results$values$`RF~Accuracy`,results$values$`GBM~Accuracy`)))
## The average accuracy from all models is: 0.7442668
After we achieve a short list of machine learning algorithms with good level of accuracy, we can leverage ways to improve the accuracy of the models.
Using the three best-perfoming algorithms from the previous section, we will Search for a combination of parameters for each algorithm that yields the best results.
Finally, we will tune the best-performing algorithms from each group further and see whether we can get more accuracy out of them.
# Tuning algorithm #1 - Random Forest
if (notifyStatus) email_notify(paste("Algorithm #1 tuning has begun!",date()))
startTimeModule <- proc.time()
set.seed(seedNum)
grid <- expand.grid(mtry = c(2,15,28,31,54))
fit.final1 <- train(targetVar~., data=Xy_train, method="rf", metric=metricTarget, tuneGrid=grid, trControl=control)
plot(fit.final1)
print(fit.final1)
## Random Forest
##
## 10584 samples
## 54 predictor
## 7 classes: '1', '2', '3', '4', '5', '6', '7'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 9525, 9527, 9526, 9524, 9524, 9526, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.6516423 0.5935810
## 15 0.8534582 0.8290341
## 28 0.8546871 0.8304678
## 31 0.8548756 0.8306877
## 54 0.8466553 0.8210975
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 31.
proc.time()-startTimeModule
## user system elapsed
## 1736.863 2.197 1742.693
if (notifyStatus) email_notify(paste("Algorithm #1 tuning completed!",date()))
# Tuning algorithm #2 - Gradient Boosting
if (notifyStatus) email_notify(paste("Algorithm #2 tuning has begun!",date()))
startTimeModule <- proc.time()
set.seed(seedNum)
grid <- expand.grid(nrounds=c(100,200,300,400,500), max_depth=3, eta=0.4, gamma=0, colsample_bytree=0.8, min_child_weight=1, subsample=0.75)
fit.final2 <- train(targetVar~., data=Xy_train, method="xgbTree", metric=metricTarget, tuneGrid=grid, trControl=control, verbose=F)
plot(fit.final2)
print(fit.final2)
## eXtreme Gradient Boosting
##
## 10584 samples
## 54 predictor
## 7 classes: '1', '2', '3', '4', '5', '6', '7'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 9525, 9527, 9526, 9524, 9524, 9526, ...
## Resampling results across tuning parameters:
##
## nrounds Accuracy Kappa
## 100 0.8163281 0.7857156
## 200 0.8295570 0.8011495
## 300 0.8359819 0.8086454
## 400 0.8392873 0.8125019
## 500 0.8375882 0.8105197
##
## Tuning parameter 'max_depth' was held constant at a value of 3
## 0.8
## Tuning parameter 'min_child_weight' was held constant at a value of
## 1
## Tuning parameter 'subsample' was held constant at a value of 0.75
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were nrounds = 400, max_depth = 3,
## eta = 0.4, gamma = 0, colsample_bytree = 0.8, min_child_weight = 1
## and subsample = 0.75.
proc.time()-startTimeModule
## user system elapsed
## 579.533 3.092 293.621
if (notifyStatus) email_notify(paste("Algorithm #2 tuning completed!",date()))
results <- resamples(list(RF=fit.final1, GBM=fit.final2))
summary(results)
##
## Call:
## summary.resamples(object = results)
##
## Models: RF, GBM
## Number of resamples: 10
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## RF 0.8489141 0.8519127 0.8534279 0.8548756 0.856771 0.8657845 0
## GBM 0.8241966 0.8384505 0.8413596 0.8392873 0.842155 0.8495743 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## RF 0.8237380 0.8272304 0.8289985 0.8306877 0.8328957 0.8434159 0
## GBM 0.7948948 0.8115241 0.8149168 0.8125019 0.8158483 0.8245033 0
dotplot(results)
Once we have narrow down to a model that we believe can make accurate predictions on unseen data, we are ready to finalize it. Finalizing a model may involve sub-tasks such as:
if (notifyStatus) email_notify(paste("Model Validation and Final Model Creation has begun!",date()))
predictions <- predict(fit.final1, newdata=Xy_test)
confusionMatrix(predictions, y_test)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3 4 5 6 7
## 1 477 102 0 0 0 0 23
## 2 111 454 5 0 9 3 0
## 3 1 15 543 17 6 64 0
## 4 0 0 28 624 0 13 0
## 5 14 57 5 0 621 7 1
## 6 2 15 67 7 12 561 0
## 7 43 5 0 0 0 0 624
##
## Overall Statistics
##
## Accuracy : 0.8607
## 95% CI : (0.8503, 0.8706)
## No Information Rate : 0.1429
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8374
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3 Class: 4 Class: 5 Class: 6
## Sensitivity 0.7361 0.7006 0.8380 0.9630 0.9583 0.8657
## Specificity 0.9678 0.9671 0.9735 0.9895 0.9784 0.9735
## Pos Pred Value 0.7924 0.7801 0.8406 0.9383 0.8809 0.8449
## Neg Pred Value 0.9565 0.9509 0.9730 0.9938 0.9930 0.9775
## Prevalence 0.1429 0.1429 0.1429 0.1429 0.1429 0.1429
## Detection Rate 0.1052 0.1001 0.1197 0.1376 0.1369 0.1237
## Detection Prevalence 0.1327 0.1283 0.1424 0.1466 0.1554 0.1464
## Balanced Accuracy 0.8520 0.8338 0.9057 0.9762 0.9684 0.9196
## Class: 7
## Sensitivity 0.9630
## Specificity 0.9877
## Pos Pred Value 0.9286
## Neg Pred Value 0.9938
## Prevalence 0.1429
## Detection Rate 0.1376
## Detection Prevalence 0.1481
## Balanced Accuracy 0.9753
predictions <- predict(fit.final2, newdata=Xy_test)
confusionMatrix(predictions, y_test)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3 4 5 6 7
## 1 456 108 0 0 1 1 31
## 2 145 448 9 0 20 9 1
## 3 1 15 520 13 8 86 0
## 4 0 0 27 626 0 12 0
## 5 14 59 14 0 610 4 1
## 6 1 14 78 9 9 536 0
## 7 31 4 0 0 0 0 615
##
## Overall Statistics
##
## Accuracy : 0.8402
## 95% CI : (0.8292, 0.8507)
## No Information Rate : 0.1429
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8135
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3 Class: 4 Class: 5 Class: 6
## Sensitivity 0.7037 0.69136 0.8025 0.9660 0.9414 0.8272
## Specificity 0.9637 0.95267 0.9684 0.9900 0.9763 0.9715
## Pos Pred Value 0.7638 0.70886 0.8087 0.9414 0.8689 0.8284
## Neg Pred Value 0.9513 0.94877 0.9671 0.9943 0.9901 0.9712
## Prevalence 0.1429 0.14286 0.1429 0.1429 0.1429 0.1429
## Detection Rate 0.1005 0.09877 0.1146 0.1380 0.1345 0.1182
## Detection Prevalence 0.1316 0.13933 0.1418 0.1466 0.1548 0.1426
## Balanced Accuracy 0.8337 0.82202 0.8854 0.9780 0.9588 0.8993
## Class: 7
## Sensitivity 0.9491
## Specificity 0.9910
## Pos Pred Value 0.9462
## Neg Pred Value 0.9915
## Prevalence 0.1429
## Detection Rate 0.1356
## Detection Prevalence 0.1433
## Balanced Accuracy 0.9700
startTimeModule <- proc.time()
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
set.seed(seedNum)
# Combining the training and test datasets to form the original dataset that will be used for training the final model
xy_complete <- rbind(Xy_train, Xy_test)
# finalModel <- randomForest(targetVar~., xy_complete, mtry=31, na.action=na.omit)
# summary(finalModel)
proc.time()-startTimeModule
## user system elapsed
## 0.024 0.001 0.025
#saveRDS(finalModel, "./finalModel_MultiClass.rds")
if (notifyStatus) email_notify(paste("Model Validation and Final Model Creation Completed!",date()))
proc.time()-startTimeScript
## user system elapsed
## 7241.023 61.208 5090.206